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We report on the search for the top quark in pp collisions at the 
Fermilab Tevatron (C® = 1.8 TeV) in the di-lepton and lepton-fjets 
channels using multivariate methods. An if-matrix analysis of the 
e/i data corresponding to an integrated luminosity of 13.5 ± 1.6 pb~^ 
yields one event whose likelihood to be a top quark event, assuming 
mtop = 180 GeV/c^, is ten times more than that of WW and eighteen 
times more than that of Z ^ rr. A neural network analysis of the 
e-fjets channel using a data sample corresponding to an integrated 
luminosity of 47.9 ± 5.7 pb~^ shows an excess of events in the signal 
region and yields a cross-section for tt production of 6.7 ± 2.3 (stat.) 
pb, assuming a top mass of 200 GeV/c^. An analysis of the e-fjets data 
using the probability density estimation method yields a cross-section 
that is consistent with the above result. 


INTRODUCTION 


The top quark that remained elusive for over a decade and a half has finally 
been observed by both the CDF and D0 collaborations (||,^). The top quark 
events have been observed in the di-lepton and lepton-bjets decay modes of tt 
pairs produced in pp collisions at y/s = 1.8TeV at the Fermilab Tevatron. The 
collaborations have used conventional analysis methods to optimize cuts on 
kinematic variables together with the tagging of the b quarks, to discriminate 
top quark events from background. The conventional analysis methods do 
not exploit correlations amongst the variables on which the cuts are applied 
and thus may suffer a loss in signal efficiency. The D0 collaboration has 
been applying multivariate methods such as the _ff-matrix, probability density 
estimation (PDF) and neural networks for identifying top quark events (HJ), 
to improve the signal efficiency. In this paper, we describe the multivariate 
methods used, we present an analysis of the channel ti —> ep and report on 
the measurement of ti production cross-section from a study of the channel 
ti e-bjets. 
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MULTIVARIATE CLASSIFIERS 


A classifier is any procedure that assigns objects to classes. In the present 
context, a classifier would separate signal events from the background. The 
time-honored conventional classification methods of examining uni-variate (1- 
dimensional) and bi-variate (2-dimensional) distributions of variables to opti¬ 
mize cuts for separating signal and background events do not in general pro¬ 
vide the maximum possible discrimination when correlations exist between 
variables. Multivariate classifiers which fully exploit the correlations that ex¬ 
ist among several variables provide a discriminating boundary between signal 
and background in multi-dimensional space that can yield discrimination close 
to the theoretical maximum (Bayes’ limit (^)). 

In the multivariate approach, one encodes each event as a point in a multi¬ 
dimensional space, called feature space, corresponding to a vector x of feature 
variables such as electron Et {E^), neutrino Et, {Ifr), Ht {'EET{jets)), etc. 
This feature space is then mapped into a one or a few-dimensional output 
space in such a way that the signal and background vectors are mapped onto 
different regions of the output space. The aim of the multivariate methods 
is to reduce the dimensionality of the problem without losing information in 
the process. The optimal way to partition the feature space into signal and 
background regions is to choose the mapping to be the Bayes discriminant 
function. Each cut on the value of the function corresponds to a discriminating 
boundary in feature space. The Bayes discriminant function is simply the 
ratio of the probability P{s\x) that a given event is a signal event and the 
probability P{b\x) that it is a background event. It is written as 


U(r] = = P(a;|s)P(s) 

P{b\x) P{x\b)P{b) 


( 1 ) 


The quantities P{x\s), P{x\b) are the likelihood functions for the signal and 
background, respectively (hereafter denoted as f{x) with or without appro¬ 
priate subscript). The ratio of the prior probabilities is the ratio of the 
signal and background cross-sections. Some multivariate classifiers approxi¬ 
mate the likelihood functions while the neural network classifier arrives at the 
Bayesian probability for the signal, P(s|a;), without calculating the likelihood 
functions for each class separately. The three classifiers being used by D0 are 
described in the following sections. 


R—matrix Method 

This is the familiar covariance matrix method which is also known as the 
Gaussian Classifier. It was introduced in the 1930s (^,0) as a tool for discrim¬ 
inating one class of feature vector x from another. The vector x is assumed to 
be distributed according to a multivariate Gaussian with covariance matrix 
M and mean x. The likelihood function is therefore. 
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f{x) = A-ey.^{-]^{xi-Xif {M %j{xj - Xj)} = A ■ ey.^{-x^) (2) 

where is the iJ-matrix. Fisher showed that the optimal way to 

separate two overlapping multivariate Gaussian distributions with a common 
covariance matrix but with different means Xs and Xb is to cut on the function, 

P=^{Xb-xly, ( 3 ) 

F is called the Fisher linear discriminant function. If the two distributions 
have different correlation matrices, one can introduce a more general Gaussian 
classifier (^), where the values are calculated using the corresponding H- 
matrices as well as mean values for the signal and background classes. We note 
that this method is useful even when the distribution of x is non-Gaussian. 

The Bayes discriminant function R{x) can be written in terms of the Fisher 
variable F as R{x) = exp(F) when P{s) = P{b). 


Probability Density Estimation (PDE) Method 


In the PDE method the likelihood functions or the probability density 
functions (pdf’s) are approximated by summing over multivariate kernel func¬ 
tions with one kernel function centered at each data point for the two classes 
of events. The expression for the likelihood function is. 


fix) = 


1 


Ne. 


^events^l.. .hd 


E n*'(- 

i=i j=i 


( 4 ) 


where the kernel function K is chosen to be a Gaussian. The variable is 
denoted by Xj and, Xij denotes the variable of the event. By appropriate 
transformation, the variables Xj are rendered uncorrelated within the signal 
and background classes. The quantity hj is the smoothing parameter. We 
use a single “global” smoothing parameter h defined by 


hsj — hc7 sj ^ bbj — hdbjf (5) 

where cjsj and dbj are the estimated standard deviations of the variable for 
the signal and background classes, respectively. The value of the smoothing 
parameter h is set by maximizing the signal to background ratio (S/B) at the 
required signal efficiency. 

The discriminant function we use in the PDE method is 


D{x) 


fsjx) 

fsix) + fb{x) 


( 6 ) 


where fsix) and fbix) are the pdf’s for signal and background classes of events, 
respectively. The function D(x) approximates the Bayesian probability for the 
signal P(s\x). When P(s) = P(b), the Bayes discriminant function becomes 
R{x) = D{x)/{1 - D{x)). 
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Neural Networks 

Artificial neural networks provide a powerful new paradigm for event clas¬ 
sification. The most commonly used architecture in classification problems is 
the multi-layer perceptron or feed-forward neural network. In Fig. |^, we show 
the representation of a three layer feed-forward neural network with one hid¬ 
den layer. The nodes in the input layer correspond to the components Xk of 
the feature vector x, and the output layer has a single node commonly used in 
binary classification problems. The network builds an internal representation 
of the mapping of the feature space into the output space. The output of the 
network is given by 

0{x) = gC^Wjgi^WjkXk + 9j) + 0), (7) 

j k 

where the “weights” Wjk and Wj and, the “thresholds” 9j and 9 are parameters 
that are adjusted during the “training” process. The quantity g is a non-linear 
“transfer” function of the form g{y) = 1/(1 -|- (Use of such transfer 

functions enables the mapping of any real function & .) The parameters are 
determined by minimizing the mean square error between the actual output 
QP and the desired output 


Afp 

^ = ( 8 ) 
^ p—i 

with respect to the parameters. Here p denotes a feature vector or pattern. 
Once the parameters are determined using a large number of signal and back¬ 
ground events the network can be used to classify events. It has been shown 
© that the feed-forward neural network when trained as a classifier using 
the back-propagation algorithm for updating the parameters, yields an output 
that approximates the Bayesian probability for the signal i.e., 0{x) = P{s\x). 
(This assumes t^ is 1 for signal and 0 for background.) The Bayes discriminant 
in terms of the network output will be R{x) = 0{x)/{l — 0{x)). 


ANALYSIS OF THE D0 DATA 


We have applied the multivariate methods to the analysis of the ti —> eg, 
and ti e-|-jets channels. The on-line trigger selection, off-line electron and 
muon identification criteria and description of variables used here can be found 
elsewhere (12 1^). 

The eg data analysed here correspond to an integrated luminosity of 
13.5±1.6 ph~^. The overall trigger efficiency is about (90±7)% for mtop= 
180 GeV/c^ and varies slightly with top mass. In the off-line selection, be¬ 
fore analysis with the i7-matrix method, we apply loose electron and muon 
identification criteria and require >11 GeV and >11 GeV/c and at 
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FIG. 1. A feed-forward neural network with one hidden layer. 


least two reconstructed jets (Sy®* > 8 GeV). Using the iJ-matrix method we 
have examined the signal to background ratio with respect to WW —> and 
Z TT (the dominant backgrounds). 

We have used the PDE and neural network methods to analyze e-|-jets 
data corresponding to an integrated luminosity of 47.9±5.7 pb~^. The domi¬ 
nant background to the tt e-|-jets channel is from the QCD production of 
W-l-multi-jets where the signature is the same as that of the signal, viz. a 
high Pt electron, high IpT: arising from the leptonic decay of the W boson, 
and several jets. In addition, we have background from QCD multi-jet events 
where one of the jets is mis-identified as an electron and the event also has a 
high I/^T from mis-measurements as well as neutrinos from any heavy flavor 
decays. We refer to this background as QCD fakes. The two backgrounds in 
our data prior to the multivariate analyses are estimated directly from data. 
The QCD fake background is estimated by the joint probability of multi-jet 
events having larger than the cut applied and a jet being mis-identified 
as an electron. The W-fjets background is estimated using Berends’ scaling 
The inclusive jet multiplicity data (after subtracting the QCD fakes 
background) is fitted for this “jet-scaling” allowing for contribution from top 
quark events. These background numbers are multiplied by the fraction of 
events surviving the multivariate cuts to get the final background estimates. 


77-matrix Analysis of e/r Data 

We have chosen the following feature vector x={E^, P^, 

Ht, Meft, A(/)e^) to build the signal and background i7-matrices, where 
is the Ijir in the calorimeter, is the invariant mass of the two leptons 
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and A^e/x is the azimuthal angle between the two leptons. We have built 
the signal //-matrix Hxop using 180 GeV ti ISAJET Monte Carlo events 
(ttl80) processed through the D0 detector simulation program. Since in the 
e/i data we expect very few ti —> e/r events, we have chosen to use the data 
to model the background. The data consist largely of QCD bb events and 
W ^-fjets-l-e' events (where e' denotes a jet that fakes an electron). We 
have considered the Z ^ tt background separately to get better rejection 
against that background. We define two Fisher discriminant functions 

= 2^XData ~ XTop)^ ^2 = ^ (xl ~ XTop) ■ (9) 

where, 

XData = “ XifHoataiXj - Xj), x| = - XifHz{Xj - Xj), 

( 10 ) 

XTop ~ ^ ~ Xi) HTop(Xj — Xj), (11) 

i,3 

where Hp)ata and Hz are the background i/-matrices built using data and 
Z —> TT Monte Carlo events, respectively. The values, Fi and F 2 are 
determined for signal, backgrounds and for data. In Fig. || we show the lego 
plots of Fi vs F 2 for each of the samples. By applying the cuts Fi > 15 and 
F 2 >3 we have 16%, 22% and 25% efficiency for top events with top masses 
of 140, 160 and 180 GeV/c^, respectively. The only event that survives the 
cuts is the same as that found in the conventional analysis (^^. This event 
lies in a region of phase space where the the signal to background ratio (S/B) 
is about 18 with respect to Z ^ rr and 10 with respect to WW for a 180 
GeV/c^ top quark. 


PDE Analysis of 6~h J6ts D&ts. 

The PDE method has been applied to the e+ > 3jets data (^. The selec¬ 
tion criteria used are >20 GeV, Ifr > 20 GeV and at least 3 jets with 
Ft >15 GeV. These five transverse energies define our feature vector in the 
analysis. The two backgrounds are combined in the ratio estimated as in 
the conventional analysis and are treated as a single background to build the 
pdf. Figure ^ shows the distributions of the discriminant function D{x) for 
background events, 180 GeV/c^ top quark events and for D0 data. Applying 
a cut of D{x) > 0.8 yields 21 data events with an estimated background of 
14.Oil.6 events in 47.9 pb~^. The product of efficiency and branching ratio 
for ttl80 events is 3.1% (as compared to 1.8% for conventional analysis). The 
ti cross-section is calculated to be 4.7i3.3 (stat.) pb, in agreement with the 
results from the conventional analysis . 


EVENTS 


FIG. 2. Fi vs F 2 from H-matrix analysis of e/i channel for (a) D0 data (f Ldt=13.5 
pb-^), (b) Z ^ TT {{f Ldt=3.1 fb-^), (c) WW {J Ldt=22.3 fb-^) and (d) ttl80 
(f Ldt=20.1 fb~^) samples. 



FIG. 3. The PDE discriminant function for (a)background, (b)ttl80 and for (c)D0 










Neural Network Analysis of e+jets Data 


A discussion of a two-variable and a six-variable analysis of the e+ > 4jets 
data using neural networks has been presented previously Here we present 
results of the analyses including recent data. The neural network program 
used here is JETNET 3.0 dil)- We use the Et of the various measured objects 
in the event {E^, IJJt, Et of jets), the event shape variable aplanarity (A) and 
the total transverse energy Hr of central jets (pseudorapidity I 77 I < 2 . 0 ) to 
discriminate the top signal events from the backgrounds. In our conventional 
analysis of e+ > 4jets using non-tagged data we have applied selection cuts of 
Ef. >20 GeV, >25 GeV, ErHetA) >15 GeV, A >.05 and Hr >200 GeV. 
(Jets are ordered in decreasing Et] jet4 refers to the jet with fourth highest 
Et ) For demonstration purposes, we compare in Fig. ^ the conventional 
cuts on A and EIt with the contour cut obtained by a simple network with 
2 input nodes, 2 hidden nodes and one output node. The A and EIt are 
used as inputs and the network is trained on ttlSO and background events 
(a mixture of W-l-jets and QCD fakes combined in the proper ratio). The 
contour provides better signal efficiency than the conventional cuts for the 
same signal to background ratio. 




HT HT 

Top 180 Background 

FIG. 4. A and Ht scatter plot for signal and background with conventional cuts 
(lines parallel to the co-ordinate axes) and neural network cut (the contour). 


In order to achieve higher signal efficiency we have relaxed the number of 
jets required and have carried out an analysis of the e+ > 3jets data. Also, we 
use a five-dimensional feature vector x={E^, I^t, Ht, A , i?T(jet3)). For this 
analysis, we use two different networks to discriminate against IF-|-jets and 
QCD fake background separately. That is, we train one network with tilSO as 
signal and W-l-jets Monte Carlo events (using the VECBOS event generator) 
as background and a second network with ttlSO as signal and QCD fakes (data 
events that fail the electron ID cuts) as background. We use networks with 
5 input nodes (corresponding to the 5-dimensional feature vector), 5 hidden 
nodes in one hidden layer and 1 output node. We use 1300 ti events, 1300 
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W^+jets events and 590 QCD fake events for training. The testing is done on 
2400 tt events (which include the 1300 events used for training), 1300 TT+jets 
events and 590 QCD fake events that were used for training. Training and 
testing on the same set of events with the given sample size can give rise to 
an uncertainty (f« 10%) in the estimated background which is included in the 
systematic uncertainty. The target output of the network during training 
is set to be 1 for the signal and 0 for the background. 



FIG. 5. Distributions of the output from the first network for (a)ttl80, (b)VF+jets 
(VECBOS), (c)QCD fakes and for (d)D0 data. 


Figure || shows the output of the first network (NNl) for ttlSO, VF+jets, 
QCD fakes and for D0 data. The distributions peak close to 1 for signal 
events and close to 0 for background events, as expected. In Fig. ^ we 
show the output distributions from the first network for data and for the 
background (VF+jets and QCD fakes combined) normalized to the number 
of events expected in 47.9 pb~^. We have estimated our background to be 
(80±11)% VF+jets and (20 ±5)% QCD fakes. (Errors are statistical only.) 
The distributions in Fig. ^ are statistically consistent with each other in the 
background region (NNl close to 0) and we observe an excess of data events 
in the signal region. 

For the most part, the kinematic distributions of QCD fakes are similar to 
kF+jets and, therefore, all but a small part of QCD fakes can be rejected with 
the first network. However, to get better rejection of QCD fakes, we process 
all samples through the second network. In Fig. ^ we show the output of the 
second network (NN2) for signal, background and data events which satisfy 
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FIG. 6. Comparison of outputs from the first network for D0 data and background 


the cut NNl >0.7. Applying a cut NN2>0.5, we get a factor of three more 
reduction in the QCD fake background. 

We examine the distributions of the five input variables for data and back¬ 
ground in the region NN1<0.4, NN2<0.4 noting that only about 5% of the 
ttl80 events lie in that region. Given that the events in the region are mostly 
background we can check if our background modeling is correct. The distri¬ 
butions for data and the combined background are compared in Fig. |^. There 
is good agreement between data and the background model. 

Applying the cuts NN1>0.7 and NN2>0.5 yields 25 candidate events with 
an estimated background of 10.1±1.5. This gives an excess over background 
of 14.9±5.2 events. The 25 candidate events found here include most of the 
non-tagged and /i-tagged candidate events found by the conventional analysis. 
The product of efficiency and branching ratios are 4.0% and 4.6% (compared 
to 1.8 % and 2.4% for conventional analysis) for ttl80 and tt200, respectively. 
For a top quark mass of 200 GeV/c^, we obtain a tt production cross-section 
of 6.7±2.3 pb. (For tilSO, the cross-section obtained is 7.8±2.7 pb.) The errors 
quoted are statistical only. A preliminary estimate of the systematic uncer¬ 
tainty which includes errors in background estimation and signal efficiency 
(prior to multivariate analyses) and neural network specific uncertainties is 
about 30%. This is dominated by the first two components and work is in 
progress to reduce these uncertainties. 
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Network Output 


FIG. 7. Output of the second network (NN2) after requiring NN1>0.7 for (a)ttl80, 
(b)IF+jets (VECBOS), (c)QCD fakes and for (d)D0 data. 



FIG. 8. Distributions of input variables compared for data(solid histograms) 
and combined backgrounds (dashed histograms) after applying cuts NN1<0.4 and 
NN2<0.4 (anti-Top cuts). 
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SUMMARY 

We have applied multivariate analysis methods to search for top quark 
events in the D0 data and we hnd a significant excess of events over back¬ 
ground. An H-matrix analysis of e/r data (J Ldt = 13.5±1.7 pb~^) yields 
one candidate event (same as found in the conventional analysis) which lies 
in a phase space region where S/B = 10 with respect to WW and S/B 
= 18 with respect to ^ ^ rr. Preliminary results from a PDE analysis 
of the e+ > 3jets data are consistent with results from the conventional 
analysis. A preliminary neural network analysis of e+ > 3jets data yields 
O'tiii^top = 200GeU/c^) = 6.7±2.3 (stat.) pb in agreement with our published 
(^) results( att{mtop = 200GeV/c^) = 6.3±2.2p6). 

ACKNOWLEDGEMENTS 

We thank the Fermilab Accelerator, Computing and Research Divisions and 
support staffs at the collaborating institutions for their contributions to this 
work. 

This work is supported in part by the U.S. Department of Energy. 


REFERENCES 

1. F. Abe et al., (CDF Collaboration) Phys. Rev. Lett. 74, 2626 (1995). 

2. S. Abachi et al., (D0 Collaboration) Phys. Rev. Lett. 74, 2632 (1995). 

3. Pushpalatha C. Bhat (D0 Collaboration), FERMILAB-CONF-94-261-E. To be 
published in the proceedings of 1994 Meeting of the American Physical Society, 
Division of Particles and Fields (DPF 94), Albuquerque, NM, 2-6 Aug 1994. 

4. H.E. Miettinen (D0 Collaboration), Proc. AIHENP (1995). 

5. R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis (Wiley, 
New York, 1973). 

6. R. A. Fisher, Annals Eugenics 7(1936)179. 

7. P.C. Mahalanobis, Proc. Nat. Inst. Sci. India, Part 2A, 49 (1936). 

8. M. Kendall, A. Stuart and J.K. Ord, “The Advanced Theory of Statistics”, Vol. 
3, 4th ed., C. Griffin & Co. Ltd., London. 

9. L. Holmstrom, S.R. Sain and H.E. Miettinen, submitted to Comput. Phys. Com- 
mun. 

10. E.K. Blum and L.K.Li, Neural Networks 4, 511 (1991) 

11. D.W. Ruck et al., IEEE Trans. Neural Networks 1, 296 (1990) 

12. J. Bantly, these Proceedings. 

13. S. Abachi et al., (D0 Collaboration), FEMILIAB-PUB-1995/020-E, submitted 
to Phys. Rev. D. 

14. F.A. Berends, H. Kuijf, B. Tausk and W.T. Giele, Nucl. Phys. B357, 32 (1991). 

15. S. Abachi et al., (D0 Collaboration), Phys. Rev. Lett. 72, 2138 (1994). 

16. JETNET 3.0 LUND Preprint, LU TP 93-29 (1993). 


